[AutoRound] Support GLM-Image W4A16 quantization model#3059
[AutoRound] Support GLM-Image W4A16 quantization model#3059lvliang-intel wants to merge 8 commits into
Conversation
f4723c4 to
e194970
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
-
Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in
docs/user_guide/diffusion/quantization/autoround.md:| GLM-Image |
Intel/GLM-Image-int4-AutoRound| W4A16 | 128 | GPTQ-Marlin |
|
please add the latency test results as well |
Will update the performance test result soon. |
5f0f515 to
ef8ec3f
Compare
Thanks for reminding me this. Doc updated. |
|
Please add the necessary ut test cases. |
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
ef8ec3f to
d6af213
Compare
ut added. |
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
|
Can you try with longer seq? |
91920fc to
0bf6bbb
Compare
Sure, I will run the performance test with longer sequence. |
|
Merge conflicts need fixing before review. |
…o feats/ar-w4a16-glm-image Signed-off-by: lvliang-intel <liang1.lv@intel.com>
0bf6bbb to
0ffcc2a
Compare
| | Model | Scope | Status | Notes | | ||
| |-------|-------|--------|-------| | ||
| | BAGEL | Checkpoint-defined diffusion or transformer stage | Not validated | Requires a compatible AutoRound checkpoint | | ||
| | GLM-Image | Checkpoint-defined diffusion or transformer stage | Not validated | Requires a compatible AutoRound checkpoint | |
There was a problem hiding this comment.
| | GLM-Image | Checkpoint-defined diffusion or transformer stage | ✅ | AutoRound checkpoint name | |
|
@yenuo26 ptal thx |
| quantization configs for W4A16/AutoRound quantization support. | ||
| """ | ||
|
|
||
| from unittest.mock import MagicMock |
There was a problem hiding this comment.
It is recommended to use pytest mock.
| if stage_config_path: | ||
| gen_kwargs["stage_configs_path"] = stage_config_path | ||
|
|
||
| with OmniRunner(model_name, seed=42, **gen_kwargs) as runner: |
There was a problem hiding this comment.
maybe you can use omni_runner fixture
| first_output = outputs[0] | ||
| assert first_output.final_output_type == "image" | ||
| req_out = first_output.request_output | ||
| assert isinstance(req_out, OmniRequestOutput) and hasattr(req_out, "images") |
There was a problem hiding this comment.
It is more suitable to be placed in nightly.
1.Please rename the script to xxxx_expansion.py.
2.please modify advanced_model to full_model
3.please add this test in test-nightly.yml
| ``transformers.models.glm_image`` at module init, which may not be available | ||
| in all environments. | ||
| """ | ||
|
|
There was a problem hiding this comment.
please add pytest.mark.xxxxx
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Support GLM-Image W4A16 AutoRound quantization in vLLM-Omni, extending the existing AutoRound W4A16 infrastructure (originally built for FLUX and Qwen3-Omni) to the GLM-Image diffusion model. This reduces model size by ~4x and GPU memory footprint while preserving generation quality.
https://huggingface.co/Intel/GLM-Image-int4-AutoRound
Related: #1325, #1777, #2670
Key changes:
Replace all nn.Linear / ColumnParallelLinear / RowParallelLinear projection layers in the GLM-Image DiT with their vLLM quantized-aware counterparts (ReplicatedLinear, ColumnParallelLinear, RowParallelLinear with quant_config).
Also added contiguous calls before RowParallelLinear (required for FP8/W4A16 kernels) and tuple-unpacking for ReplicatedLinear output.
Test Plan
E2E offline inference tests added.
TIIF-Bench accuracy evaluation test.
DPG-Bench accuracy evaluation test.
Test Result
TIIF-Bench Accuracy (9 Sub-Attributes Average)
Summary:
Average accuracy drop: ~1.3% Accuracy degradation is minimal and within an acceptable range for 4-bit quantization.
Model Size Reduction
Overall checkpoint is ~3.8× smaller
E2E Generation Smoke Test
The quantized W4A16 model maintains full pipeline functionality with no critical degradation in generation behavior.
Performance Test Result on A100
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)